Recent Advances in Vision Foundation Models


In conjunction with CVPR 2024

June 17th 2024 (9 a.m. PDT — 5 p.m. PDT)

Location: Summit 437- 439, Seattle Convention Center

Photo by Thom Milkovic on Unsplash

CVPR 2024 Tutorial on "Recent Advances in Vision Foundation Models"

Visual understanding at different levels of granularity has been a longstanding problem in the computer vision community. The tasks span from image-level tasks (e.g., image classification, image-text retrieval, image captioning, and visual question answering), region-level localization tasks (e.g., object detection and phrase grounding), to pixel-level grouping tasks (e.g., image instance/semantic/panoptic segmentation). Until recently, most of these tasks have been separately tackled with specialized model designs, preventing the synergy of tasks across different granularities from being exploited.

In light of the versatility of transformers and inspired by large-scale vision-language pre-training, the computer vision community is now witnessing a growing interest in building general-purpose vision systems, also called vision foundation models, that can learn from and be applied to various downstream tasks, ranging from image-level , region-level, to pixel-level vision tasks.

In this tutorial, we will cover the most recent approaches and principles at the frontier of learning and applying vision foundation models, including (1) Learning Vision Foundation Models for Multimodal Understanding and Generation; (2) Benchmarking and Evaluating Vision Foundation Models; (3) Agents and other Advanced Systems based on Vision Foundation Models.

Program (PDT, UTC-8)

You are welcome to join our tutorial either in-person or virtually via Zoom (Click into the CVPR2024 portal to find the Zoom link).

Morning Session
9:00 - 9:20 Opening Remarks   [Slides]   [Bilibili, YouTube] Lijuan Wang
9:20 - 10:10 Large Multimodal Models: Towards Building General-Purpose Multimodal Assistant   [Slides]   [Bilibili, YouTube] Chunyuan Li
10:10 - 11:00 Methods, Analysis & Insights from Multimodal LLM Pre-training   [Slides]   [Bilibili, YouTube] Zhe Gan
11:00 - 11:50 LMMs with Fine-Grained Grounding Capabilities   [Slides]   [Bilibili, YouTube] Haotian Zhang
Afternoon Session
13:00 - 13:50 A Close Look at Vision in Large Multimodal Models   [Slides]   [Bilibili, YouTube] Jianwei Yang
13:50 - 14:40 Multimodal Agents   [Slides]   [Bilibili, YouTube] Linjie Li
14:40 - 15:00 Coffee Break & QA  
15:00 - 15: 50 Recent Advances in Image Generative Foundation Models   [Slides]   [Bilibili, YouTube] Zhengyuan Yang
15:50 - 16: 40 Video and 3D Generation   [Slides]   [Bilibili, YouTube] Kevin Lin
16:40 - 17:00 Closing Remarks & QA  

Organizers

Chunyuan Li

Tiktok

Zhe Gan

Apple

Jianwei Yang

Microsoft

Linjie Li

Microsoft

Zhengyuan Yang

Microsoft

Kevin Lin

Microsoft

Jianfeng Gao

Microsoft

Lijuan Wang

Microsoft

Contacts

Contact the Organizing Committee: vlp-tutorial@googlegroups.com